High-level Plotting

matplotlib is a relatively low-level plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.


In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)

In [ ]:
normals = pd.Series(np.random.normal(size=10))
normals.plot()

Notice that by default a line plot is drawn, and a light grid is included. All of this can be changed, however:


In [ ]:
normals.cumsum().plot(grid=False)

Similarly, for a DataFrame:


In [ ]:
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                       'gamma': np.random.gamma(1, size=100), 
                       'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot:


In [ ]:
variables.cumsum(0).plot(subplots=True, grid=False)

Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:


In [ ]:
variables.cumsum(0).plot(secondary_y='normal', grid=False)

If we would like a little more control, we can use matplotlib's subplots function directly, and manually assign plots to its axes:


In [ ]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i,var in enumerate(['normal','gamma','poisson']):
    variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')

Bar plots

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the plot method with a kind='bar' argument.

For this series of examples, let's load up the Titanic dataset:


In [ ]:
titanic = pd.read_excel("../data/titanic.xls", "titanic")
titanic.head()

In [ ]:
titanic.groupby('pclass').survived.sum().plot(kind='bar')

In [ ]:
titanic.groupby(['sex','pclass']).survived.sum().plot(kind='barh')

In [ ]:
death_counts = pd.crosstab([titanic.pclass, titanic.sex], titanic.survived.astype(bool))
death_counts.plot(kind='bar', stacked=True, color=['black','gold'], grid=False)

Another way of comparing the groups is to look at the survival rate, by adjusting for the number of people in each group.


In [ ]:
death_counts.div(death_counts.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=['black','gold'])

Histograms

Frequenfly it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For example, we might want to see how the fares were distributed aboard the titanic:


In [ ]:
titanic.fare.hist(grid=False)

The hist method puts the continuous fare values into bins, trying to make a sensible décision about how many bins to use (or equivalently, how wide the bins are). We can override the default value (10):


In [ ]:
titanic.fare.hist(bins=30)

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.


In [ ]:
sturges = lambda n: int(np.log2(n) + 1)
square_root = lambda n: int(np.sqrt(n))
from scipy.stats import kurtosis
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))

n = len(titanic)
sturges(n), square_root(n), doanes(titanic.fare.dropna())

In [ ]:
titanic.fare.hist(bins=doanes(titanic.fare.dropna()))

A density plot is similar to a histogram in that it describes the distribution of the underlying data, but rather than being a pure empirical representation, it is an estimate of the underlying "true" distribution. As a result, it is smoothed into a continuous line plot. We create them in Pandas using the plot method with kind='kde', where kde stands for kernel density estimate.


In [ ]:
titanic.fare.dropna().plot(kind='kde', xlim=(0,600))

Often, histograms and density plots are shown together:


In [ ]:
titanic.fare.hist(bins=doanes(titanic.fare.dropna()), normed=True, color='lightseagreen')
titanic.fare.dropna().plot(kind='kde', xlim=(0,600), style='r--')

Here, we had to normalize the histogram (normed=True), since the kernel density is normalized by definition (it is a probability distribution).

We will explore kernel density estimates more in the next section.

Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.


In [ ]:
titanic.boxplot(column='fare', by='pclass', grid=False)

You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.


In [ ]:
bp = titanic.boxplot(column='age', by='pclass', grid=False)
for i in [1,2,3]:
    y = titanic.age[titanic.pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y.values, 'r.', alpha=0.2)

When data are dense, a couple of tricks used above help the visualization:

  1. reducing the alpha level to make the points partially transparent
  2. adding random "jitter" along the x-axis to avoid overstriking

Exercise

Using the Titanic data, create kernel density estimate plots of the age distributions of survivors and victims.


In [ ]:
# Write your answer here

Scatterplots

To look at how Pandas does scatterplots, let's reload the baseball sample dataset.


In [ ]:
baseball = pd.read_csv("../data/baseball.csv")
baseball.head()

Scatterplots are useful for data exploration, where we seek to uncover relationships among variables. There are no scatterplot methods for Series or DataFrame objects; we must instead use the matplotlib function scatter.


In [ ]:
plt.scatter(baseball.ab, baseball.h)
plt.xlim(0, 700); plt.ylim(0, 200)

We can add additional information to scatterplots by assigning variables to either the size of the symbols or their colors.


In [ ]:
plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.5)
plt.xlim(0, 700); plt.ylim(0, 200)

In [ ]:
plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=40, cmap='hot')
plt.xlim(0, 700); plt.ylim(0, 200);

To view scatterplots of a large numbers of variables simultaneously, we can use the scatter_matrix function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.


In [ ]:
_ = pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')

Seaborn

Seaborn is a modern data visualization tool for Python, created by Michael Waskom. An easy way to see how Seaborn can immediately improve your data visualization, is by setting the plot style using one of its sevefral built-in styles.

Here is a simple Matplotlib plot before Seaborn:


In [ ]:
plt.plot(normals)

Seaborn is conventionally imported using the sns alias.


In [ ]:
import seaborn as sns
sns.set()

plt.plot(normals)

Seaborn works hand-in-hand with pandas to create publication-quality visualizations quickly and easily from DataFrame and Series data.

For example, we can generate kernel density estimates of two sets of simulated data, via the kdeplot function.


In [ ]:
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])
data.head()

In [ ]:
for col in 'xy':
    sns.kdeplot(data[col], shade=True)

distplot combines a kernel density estimate and a histogram.


In [ ]:
sns.distplot(data['x'])

If kdeplot is provided with two columns of data, it will automatically generate a contour plot of the joint KDE.


In [ ]:
sns.kdeplot(data);

In [ ]:
cdystonia = pd.read_csv("../data/cdystonia.csv", index_col=None)
cdystonia.head()

In [ ]:
cdystonia16 = cdystonia[cdystonia.week==16]

In [ ]:
cmap = {'Placebo':'Reds', '10000U':'Blues'}

for treat in cmap:
    age = cdystonia16[cdystonia16.treat==treat].age
    twstrs = cdystonia16[cdystonia16.treat==treat].twstrs
    
    sns.kdeplot(age, twstrs,
        cmap=cmap[treat], shade=True, shade_lowest=False, alpha=0.3)

Similarly, jointplot will generate a shaded joint KDE, along with the marginal KDEs of the two variables.


In [ ]:
with sns.axes_style('white'):
    sns.jointplot("age", "twstrs", cdystonia16, kind='kde');

To explore correlations among several variables, the pairplot function generates pairwise plots, along with histograms along the diagonal, and a fair bit of customization.


In [ ]:
titanic = titanic[titanic.age.notnull() & titanic.fare.notnull()]

In [ ]:
sns.pairplot(titanic, vars=['age', 'fare', 'pclass', 'sibsp'], hue='survived', palette="muted", markers='+')

Another way of exploring mutliple variables simulaneously is to generate trellis plots with FacetGrid.

Let's use the titanic dataset to create a trellis plot that represents 3 variables at a time. This consists of 2 steps:

  1. Create a FacetGrid object that relates two variables in the dataset in a grid of pairwise comparisons.
  2. Add the actual plot (distplot) that will be used to visualize each comparison.

In [ ]:
g = sns.FacetGrid(titanic, col="sex", row="pclass")
g.map(sns.distplot, 'age')

Using the cervical dystonia dataset, we can simultaneously examine the relationship between age and the primary outcome variable as a function of both the treatment received and the week of the treatment by creating a scatterplot of the data, and fitting a polynomial relationship between age and twstrs:


In [ ]:
g = sns.FacetGrid(cdystonia, col="treat", row="week")
g.map(sns.regplot, 'age', 'twstrs', order=2)

References

VanderPlas, J. Data visualization with Seaborn, O'Reilly.